前面介紹了這麼多,終於要來寫一點有趣的小程式了,下面將示範如何用python寫一個簡單的爬蟲程式,在這個爬蟲程式裡可以選擇想要爬的板、頁數、關鍵字
在這個爬蟲程式裡面我們會用下列幾個模組:
requests
透過HTTP產生對網頁伺服器請求,下載指定資料re
python中的正則表達式操作模組bs4
bs4中的BeautifulSoup模組可以快速解析網頁原始碼資料下列先附上程式碼:
# -*-coding:utf-8 -*-
import requests
import re
from bs4 import BeautifulSoup
# THSRshare/ NBA/ forsale/ mobilesales
Board = 'mobilesales'
key_article = "iphone" # split with "," comma
# key_content = "螢幕"
num_pages = 5
ppt_url = 'https://www.ptt.cc'
url = 'https://www.ptt.cc/bbs/'+Board+'/index.html'
selected_url = []
for i in range(1, num_pages+1):
print(f"Page {i} : ")
web = requests.get(url) #get the website request
soup = BeautifulSoup(web.text,'html.parser') #parse the website text
articles = soup.select('div.title a') #get articles
paging = soup.select('div.btn-group-paging a') #get the pre-page group
next_url = 'https://www.ptt.cc'+paging[1]['href'] #get the pre-page url
url = next_url
for article in articles:
for key in key_article.split(","):
if article.text.find(key) != -1:
print(article.text, ppt_url + article['href'])
selected_url.append(ppt_url + article['href'])